Introduction

The NCI-60 cancer cell lines were developed as a drug screening tool focusing on a range of cancer types. In this vignette, we compare two major groups of NCI-60 cancer cell lines(1). The first is the leukemia subgroup consising of 6 cell lines (CCRF-CEM, HL-60 (TB), K-562, MOLT-4, RPMI-8226, SR). The second group is the breast/prostate/ovarian cell lines consisting of 14 total cell lines (BT-549, DU-145, HS 578T, IGROV1, MCF7, MDA-MB-231/ATCC, NCI/ADR-RES, OVCAR-3, OVCAR-4, OVCAR-5, OVCAR-8, PC-3, SK-OV-3, T-47D). The latter cancers are grouped together as they contain common susceptibility loci(2). This vigenette will highlight the analysis we conduct on the NCI-60 cancer cell lines.

Loading in IntLIM and Files

IntLIM, available from Github, can be installed as in the documentation. Once IntLIM is installed, what is necessary is loading in the library.

rm(list = ls())
library(IntLim)
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)

For the NCI-60 cell lines, metabolomics and gene expression data were downloadable from the DTP website (https://wiki.nci.nih.gov/display/ncidtpdata/molecular+target+data). The Metabolon data consisting of 353 metabolites and 58 cell lines with 177 technical replicates total was filtered for metabolites that had a median coefficient of variation of below 0.3. The coefficient value was arbitrarily selected to filter out technical replicates having high variability. The resulting metabolite abundance data set of 280 metabolites was subsequently log2 transformed. Probes from the Chiron Affymetrix U133 were mapped to genes using the Ensembl database hgu133.plus.db. Probes mapping to more than one gene were removed. In cases where more than one probe was matching to a given gene, only the probe with the highest mean expression was used for analysis. This resulted in a total of 17,987 genes.

This data has been formatted for IntLIM. We load in the data as follows. The nci60.csv meta file contains a list of phenotypic data file, metabolite data file, gene expression data file, metabolite meta file, and gene expression meta file. The function OutputStats will give a summary of the NCI-60 data

inputData <- IntLim::ReadData('nci60.csv',metabid='id',geneid='id')
## [1] "CreateMultiDataSet created"
IntLim::ShowStats(inputData)
##   Num_Genes Num_Metabolites Num_Samples_withGeneExpression
## 1     17987             280                             57
##   Num_Samples_withMetabolomics Num_Samples_inCommon
## 1                           58                   57

From the OutputStats, we find that we have gene expression data involving 17,987 genes with 57 cell line sample and metabolite abundance data involving 280 metabolites with 58 cell lines.

Filtering Gene Expression and Metabolite Data

We remove genes with mean belows below the 10th percentile. Furthermore, we remove metabolites with more than 80% missing values. This results in gene expression data involving 16188 genes and 57 cell lines and metabolite abundance data involving 220 metabolites and 58 cell lines.

inputDatafilt <- IntLim::FilterData(inputData,geneperc=0.10, metabmiss = 0.80)
## [1] "No metabolite filtering by percentile is applied"
IntLim::ShowStats(inputDatafilt)
##   Num_Genes Num_Metabolites Num_Samples_withGeneExpression
## 1     16188             220                             57
##   Num_Samples_withMetabolomics Num_Samples_inCommon
## 1                           58                   57

We can obtain boxplot distributions of the data as follows. This is used to make figures.

IntLim::PlotDistributions(inputDatafilt, palette = c('black', 'black'))

Principal Component Analysis

The principal component analysis is performed on filtered metabolite and gene expression data to obtain visual representations showing how different sub-sections of the data could be grouped into different clusters. Common samples belonging to either the Leukemia or Breast/Prostate/Ovarian groups are shown. Blue samples indicate leukemia cell lines and red samples indicate BPO. Note the clear delineation of samples.

PlotPCA(inputDatafilt, stype = "cancergroup", common = T)

Running Linear Model

The linear model is for integrating transcriptomic and metabolomics data is: E(m|g,t) = β1 + β2 g + β3 p + β4 (g:p) + ε where ‘m’ and ‘g’ are log-transformed metabolite abundances and gene levels respectively, ‘p’ is phenotype (cancer type, patient diagnosis, treatment group, etc), ‘(g:p)’ is the association between gene expression and phenotype, and ‘ε’ is the error term that is normally distributed. A statistically significant p value of the ‘(g:p)’ association term indicates that the slope relating gene expression and metabolite abundance is different from one phenotype compared to another. We run a linear model on common cell lines of the leukemia (n = 6) and BPO (n = 14) that included 16188 genes and 220 metabolites (total of 3,561,360 possible associations and hence models). For genes and metabolites that had standard deviations of 0 in one of the groups, we assign a p value of NA. The model is run as below by calling RunIntLim(). DistPvalues() allows us to obtain a distribution of p-values for the (g:p) term. We also show a volcano plot showing the relationship between phenotype correlations and FDR adjusted p-values.

myres <- IntLim::RunIntLim(inputDatafilt, stype="cancergroup")
## [1] "Running the analysis on"
## 
##      BPO Leukemia 
##       14        6 
## [1] "10 % complete"
## [1] "20 % complete"
## [1] "30 % complete"
## [1] "40 % complete"
## [1] "50 % complete"
## [1] "60 % complete"
## [1] "70 % complete"
## [1] "80 % complete"
## [1] "90 % complete"
##    user  system elapsed 
## 125.113   1.389 126.603
IntLim::DistPvalues(myres)

IntLim::pvalCorrVolcano(myres, inputDatafilt, diffcorr = 0.5, pvalcutoff = 0.10)
## Warning in KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, :
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'

The next step is to process the results of this model by filtering the results of the linear model by FDR-adjusted p-value cutoff (0.10 selected here) for the (g:p) association coefficient and calculate the correlations of the gene-metabolite pairs in each group (BPO and Leukemia) for the filtered results. We further may only interested in results that have an absolute correlation value difference (0.50 selected here). This is done with the ProcessResults function. In addition we also develop a heatmap of the gene-metabolite association correlations for the selected groups.

library(RColorBrewer)
myres <- IntLim::ProcessResults(myres,  inputDatafilt, diffcorr = 0.5, pvalcutoff = 0.10)
IntLim::CorrHeatmap(myres)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
dim(myres@filt.results)
## [1] 1009    6
OutputResults(inputResults = myres, filename = "NCI60pairs.csv")

From this model we find 1009 gene-metabolite correlations that have an association FDR-adjusted p-value of 0.10 and an absolute value correlation difference of 0.5 or greater. The top pairs are shown below.

corr.table <- myres@filt.results
abs.corrdiff <- abs(myres@filt.results$BPO_cor - myres@filt.results$Leukemia_cor)
sort.table <- corr.table[order(-abs.corrdiff),]
sort.table[1:20,]
##                                 metab     gene    BPO_cor Leukemia_cor
## 42                         malic acid  FAM174B  0.8285714   -1.0000000
## 637                            X-1597     DLG4  0.8373626   -0.9276337
## 636                            X-1497     STK4  0.7973588   -0.9428571
## 623       L-beta-imidazolelactic acid     DNER -0.7909295    0.9411239
## 317                            X-4516   DIS3L2  0.7318681   -1.0000000
## 625                     phenylalanine    SUZ12  0.7846154   -0.9428571
## 246                            X-3094     DND1  0.7670330   -0.9428571
## 39                            leucine     DLG4  0.7802198   -0.9276337
## 942                            X-2757   POMZP3 -0.7626374    0.9411239
## 1        (p-Hydroxyphenyl)lactic acid     TYMS  0.8689425   -0.8285714
## 40                         malic acid    FSCN1 -0.7450549    0.9428571
## 639                            X-1616    TBX19 -0.7986804    0.8857143
## 613  L-alpha-glycerophosphorylcholine     PPID  0.6835165   -1.0000000
## 912                            X-2746   PRSS16 -0.7362637    0.9428571
## 1003                           X-5147 KIAA0020 -0.8324444    0.8451543
## 629                            X-1288    OR5V1 -0.8285714    0.8451543
## 225                            X-3094    GSTK1 -0.7846154    0.8857143
## 624                        methionine     DND1  0.7846154   -0.8857143
## 626                     phenylalanine    ACOT9 -0.8417582    0.8285714
## 922                            X-2746    MFSD3 -0.8417582    0.8285714
##              Pval FDRadjPval
## 42   4.115444e-06 0.04017135
## 637  4.659386e-07 0.01437366
## 636  2.977084e-06 0.03476783
## 623  1.480815e-06 0.02589427
## 317  1.128022e-05 0.06191823
## 625  3.957300e-06 0.03978974
## 246  4.412132e-07 0.01428472
## 39   4.057099e-06 0.03992020
## 942  4.853060e-07 0.01465898
## 1    1.746020e-05 0.07283778
## 40   2.370649e-05 0.08244689
## 639  1.082085e-05 0.06121788
## 613  2.108533e-05 0.07870836
## 912  1.609724e-06 0.02699364
## 1003 3.244916e-05 0.09448618
## 629  1.015474e-06 0.02221261
## 225  5.643493e-06 0.04715440
## 624  2.620495e-05 0.08551853
## 626  7.339713e-06 0.05270677
## 922  2.471598e-06 0.03258898

We can show some example plots of some of these pairs. The first example is the FAM174B vs. malic acid. There is a positive correlation for BPO but a negative correlation for Leukemia.

IntLim::PlotGMPair(inputDatafilt, stype = "cancergroup", geneName = "FSCN1", metabName = "malic acid")

Another example is DLG4 vs. leucine.

IntLim::PlotGMPair(inputDatafilt, stype = "cancergroup", geneName = "DLG4", metabName = "leucine")

Another example is DNER vs. L-beta-imidazolelactic acid. There is a positive correlation for Leukemia and a negative correlation for BPO.

IntLim::PlotGMPair(inputDatafilt, stype = "cancergroup", geneName = "DNER", metabName = "L-beta-imidazolelactic acid")

Clustering the Heatmap

We will cut the heatmap into 2 trees. From there we can acquire a list of unique gene and metabolite pairs that we will run in IPA.

# Cut the tree by k = 2
hc.rows<- hclust(dist(myres@filt.results[,c(3,4)]))
ct<- cutree(hc.rows, k=2) 

# Clusters of gene-metaboltie pair table. 
cluster.1 <- myres@filt.results[which(ct == 1), ]
cluster.2 <- myres@filt.results[which(ct == 2), ]

#Sizes of clusters
dim(cluster.1)
## [1] 464   6
dim(cluster.2)
## [1] 545   6

Check for which cluster is which.

summary(cluster.1$Leukemia)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000 -0.94110 -0.88040 -0.82630 -0.77140 -0.08571
summary(cluster.2$Leukemia)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.02857  0.77140  0.88040  0.82420  0.94110  1.00000
#Cluster.1 is leukemia correlated and Cluster.2 is leukemia anti-correlated

The gene lists are output for each cluster.

# Unique genes in each cluster
leuk.corr.uniqgene <- unique(cluster.2$gene)
leuk.anti.corr.uniqgene <- unique(cluster.1$gene)

# Write csv files for genes
write.csv(leuk.corr.uniqgene, "leuk.corr.uniqgene.csv")
write.csv(leuk.anti.corr.uniqgene, "leuk.anti.corr.uniqgene.csv")

## Number of unique genes in each cluster
length(leuk.corr.uniqgene)
## [1] 429
length(leuk.anti.corr.uniqgene)
## [1] 356
# Genes that are intersecting
intersect(leuk.corr.uniqgene,leuk.anti.corr.uniqgene )
## character(0)

We also find the Leukemia correlated cluster has 545 gene-metabolite pairs and the lekemia anti-correlated cluster has 464 gene-metabolite pairs. We have 429 unique genes in the leukemia-correlated cluster and 356 unique genes in the leukemia anti-correlated cluster. Interestingly none of these genes are overlapping. These will be output into Ingenuity Pathway Analysis (IPA) (https://www.qiagenbioinformatics.com/products/ingenuity-pathway-analysis/).

We do the same for metabolites.

# Unique metabolites in each cluster
leuk.corr.uniqmetab <- unique(cluster.2$metab)
leuk.anti.corr.uniqmetab <- unique(cluster.1$metab)
length(leuk.corr.uniqmetab)
## [1] 54
length(leuk.anti.corr.uniqmetab)
## [1] 45
length(intersect(leuk.corr.uniqmetab, leuk.anti.corr.uniqmetab))
## [1] 31

There are 54 unique metabolites in the leukemia correlated cluster and 45 in the anti-correlated cluster. 31 are shared by both.
Lets match the metabolites to the HMDB ID. We have an hmdb.match file from the breast cancer data Metabolon file.

hmdb.match <- read.csv("hmdb.match.csv")

leuk.corr.metab.intersect <- intersect(as.character(leuk.corr.uniqmetab), as.character(hmdb.match$id))
length(leuk.corr.metab.intersect)
## [1] 9
leuk.anti.corr.metab.intersect <- intersect(as.character(leuk.anti.corr.uniqmetab), as.character(hmdb.match$id))
length(leuk.anti.corr.metab.intersect)
## [1] 10

Given that there are few matches (9/54 for the leukemia correlated cluster and 10/54 for the anti-correlated cluster) we decide not to pursue a metabolomics analysis on IPA.

References

  1. Su, G., Burant, C.F., Beecher, C.W., Athey, B.D. and Meng, F. (2011) Integrated metabolome and transcriptome analysis of the NCI60 dataset. BMC bioinformatics, 12, S36.

  2. Kar, S.P., Beesley, J., Al Olama, A.A., Michailidou, K., Tyrer, J., Kote-Jarai, Z., Lawrenson, K., Lindstrom, S., Ramus, S.J. and Thompson, D.J. (2016) Genome-wide meta-analyses of breast, ovarian, and prostate cancer association studies identify multiple new susceptibility loci shared by at least two cancer types. Cancer discovery, 6, 1052-1067.